Skip to content

Conversation

@pkoutsovasilis
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis commented Nov 20, 2025

Overview

This PR implements support for multiple StackConfigPolicies (SCPs) targeting the same Elasticsearch cluster or Kibana instance, using a weight-based priority system for deterministic policy composition.

Key Features

Weight-Based Priority System

  • Policies are merged in order of weight (lower weight takes precedence)
  • Default weight: 0

Conflict Detection

Conflicts are detected across multiple dimensions and will prevent policy application:

Conflict Type Condition Result
Weight Conflict Two or more policies with identical weights target the same Elasticsearch/Kibana ❌ Conflict
SecretMount Name Conflict Different policies define SecretMount with same SecretName ❌ Conflict
SecretMount Path Conflict Different policies define SecretMount with same MountPath ❌ Conflict
Different Weights Policies have different weights and none of the above applies ✅ Pass - lower weight wins

Important: Even if two policies with the same weight have non-overlapping resources, they still conflict because the weight collision makes the merge order ambiguous.

Configuration Merging Behaviour

Different merge strategies are applied based on the configuration type:

  • Deep Merge (recursive merging):

    • ClusterSettings
    • Config
    • SnapshotLifecyclePolicies
    • SecurityRoleMappings
    • IndexLifecyclePolicies
    • IngestPipelines
    • IndexTemplates.ComposableIndexTemplates
    • IndexTemplates.ComponentTemplates
  • Top-Level Key Replacement (entire keys replaced):

    • SnapshotRepositories - each repository configuration is treated atomically
  • Union Merge (with conflict detection):

    • SecretMounts - conflicts on duplicate SecretName OR duplicate MountPath
    • SecureSettings - merges by SecretName+Key, lower weight wins (no conflicts)

Multi-Soft-Owner Secret Management

File Settings and Policy Config Secrets:

  • Now support multiple soft owners
  • Secrets are only deleted when all referencing soft-owners are removed
  • Uses eck.k8s.elastic.co/owner-refs annotation with JSON-encoded map of owner namespaced names

Secret Sources:

  • Remain single soft owner (existing behaviour unchanged)

This prevents secret leakage while enabling proper cleanup when policies are deleted.

Related Issues

@pkoutsovasilis pkoutsovasilis self-assigned this Nov 20, 2025
@pkoutsovasilis pkoutsovasilis added the >enhancement Enhancement of existing functionality label Nov 20, 2025
@prodsecmachine
Copy link
Collaborator

prodsecmachine commented Nov 20, 2025

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@github-actions
Copy link

github-actions bot commented Nov 20, 2025

🔍 Preview links for changed docs

@kvalliyurnatt
Copy link
Contributor

Just a general question/comment that came to my mind, do we want to have any limits on the number of SCPs that can be associated to a cluster ?
I was just trying to think what potential issues we might face when we have a large number of SCPs(if that is even a practical scenario) associated with a single ES cluster and if there is a practical maximum that we can enforce. One thing that came to my mind while thinking about scale was the annotation for soft owners, maybe we run into some kind of limit with the annotation map size? (I think the annotation limit is 256KB which I think should not be something to worry about ?)
Wondering if there are any other such things to consider.

Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a first pass, just looking at the code. I have not tested it yet. Will try to find some more time later today.

@pkoutsovasilis
Copy link
Contributor Author

buildkite test this -f p=gke,t=TestStackConfigPolicy*

@pkoutsovasilis
Copy link
Contributor Author

buildkite test this -f p=gke,t=TestStackConfigPolicy*

Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work!

I think we have two follow-up items:

  1. improve the error/change attribution

The problem mentioned by @barkbay earlier, is worse for errors that are displayed in the status resource for any contributing policy and we currently leave it up to the user to trace back from which source it came. Can you maybe raise an issue for that?

NAMESPACE        NAME                        READY   PHASE             AGE   WEIGHT
elastic-system   elasticsearch-only-policy   1/2     ApplyingChanges   12m   0
elastic-system   kibana-only-policy          1/2     ApplyingChanges   12m   9
  1. documentation (needs to go into the docs-content repo)

@kvalliyurnatt
Copy link
Contributor

@pkoutsovasilis wanted to follow up about my comment from earlier

#8917 (comment)

wondering if we should document what limits we might hit and maybe do some sort of scale testing to see if at a certain number it becomes problematic ? wdyt ?

@pebrc
Copy link
Collaborator

pebrc commented Dec 2, 2025

@pkoutsovasilis wanted to follow up about my comment from earlier

#8917 (comment)

wondering if we should document what limits we might hit and maybe do some sort of scale testing to see if at a certain number it becomes problematic ? wdyt ?

+1 to that. I tested only with a trivial number of SCPs < 10. Maybe we should see if at higher cardinalities our reconciliation algorithm runs into problems. This could inform the documentation we write as a follow up and would allow us to give some guidance to users.

Just looking at the max annotation size for the soft-owners and assuming no other annotations exist, we could support probably ~ 2000 owners (63 char ns + 63 char name ~ 130 bytes + 32 bytes for the key). While this won't likely be the limitation I don't think we should advertise this as a limit. My gut feeling is that the number of SCPs should be < 100 and the ones targeting a single ES cluster < 10. But it would be good to have more than gut feel numbers 😄

@pkoutsovasilis
Copy link
Contributor Author

pkoutsovasilis commented Dec 2, 2025

Thanks for the follow-up @kvalliyurnatt! To be honest, as @pebrc already captured in his comment above, I'm not particularly worried about the owners annotations - those numbers are way beyond what we would ever advertise. Thinking about it, the limit that's more realistic to reach, I believe, is the stack config secret limit. We can easily hit that even when merging just 2 StackConfigPolicies that are quite lengthy, which could make us exceed the 1MB size limit of a secret. Everything else, except for the stack config secret size and the owners annotation, is pretty much the same, so this PR doesn't make any other limits easier to exceed.

So to accommodate for these my proposal is to limit the maximum number of SCPs that can target a given stack component to be 10 (it feels like a fairly high number to me). Also for the config policy secret size, we could check if the error coming from here is an IsRequestEntityTooLargeError, and if yes, set the status of the StackConfigPolicy accordingly. How does that sound @kvalliyurnatt @pebrc ?

@pebrc
Copy link
Collaborator

pebrc commented Dec 2, 2025

I would not put a limit in code. What I was suggesting was to do a bit more "scale" testing. The overall size limit of the settings file is well known. This PR does not change anything in this regard. We are pretty confident that the annotation size is not a limitation in practice. What we do not know is how the controller behaves if there are very many SCPs that need to be reconciled on each and every change. How many conflicts will we see (if any). Any other issues we don't know of. So all I am arguing for is to test the limits of this implementation a bit before we give it to our users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement Enhancement of existing functionality v3.3.0 (next)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support multiple StackConfigPolicies per cluster

5 participants